Week 3: Data Visualization

{ggplot2}

Author
Affiliation

Eunji Kong
(adapted from Dr. Joe Nese’s lecture)

University of Oregon
Fall 2025

#install.packages("tidyverse")
#install.packages("palmerpenguins")
#install.packages("patchwork")
#install.packages("ggridges")
#install.packages("gghighlight")
#install.packages("MetBrewer")
#install.packages("ggthemes")


#library(tidyverse)
#library(palmerpenguins)
#libraryy(patchwork)
#library(ggridges)
#library(gghighlight)
#library(MetBrewer)
#library(ggthemes)

Greetings!

Learning Objectives

  • Understand the basic syntax requirements for {ggplot2}
  • Recognize various options for displaying data
  • Familiarity with various {ggplot2} options/layers
  • Basically, how to graph and visualize data

Lecture/Material Structure

  • PDF Lecture Notes

    • Include hyperlinks that take you directly to the relevant topics
      • Hyperlinks: everything underlined
  • .qmd file (recommended)

    • Same information as the PDF but allows you to write notes directly in the file

    • You can also test out code interactively as you follow along

    • You can then render this document as an HTML file for later review

    • Visual mode

{tidyverse}

{tidyverse} is a a meta-package that loads a set of core packages

# If you don't have the package installed
# install.packages("tidyverse")

# load library
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

{ggplot2}

  • gg stands for “grammar of graphics”
  • Resources

Components

Every ggplot has three components:

  1. data
    • the data used to produce the plot
  2. aesthetic mappings (aes)
    • between variables and visual properties
  3. layers(s)
    • usually through the geom_*() function plus various other layers

Template

I use the base R’s version of pipe |> instead of %>% but they are essentially the same thing.

data |> #pipe here
 ggplot(aes(mapping)) + #plus here
 geom_function() +
 additional layers

Above code is the same as the bottom code.

ggplot(data, aes(mapping)) +
geom_function() +
additional layers

data

# install.packages("palmerpenguins")

library(palmerpenguins)
head(penguins)
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>
# str(penguins)

# glimpse(penguins)

# colnames(penguins)

# View(penguins)

ggplot(aes(mapping))

  • aesthetic mappings describe how variables in the data are mapped to visual properties

  • Some visual properties include:

    • x

    • y

    • color (will come back to it)

    • fill (will come back to it)

    • alpha (will come back to it)

    • others (linetype, shape, linewidth, size, group)

penguins |> 
  ggplot(aes(x = bill_length_mm, y = body_mass_g))

QUESTION: What do you see? Why is there nothing plotted?

ANSWER: 

Layers

geom_function()

Use a geom_function() to represent data points

Only 1 Variable Continuous Variable 2 Discrete Variable 2
Continuous Variable 1

geom_histogram

geom_density

geom_point

geom_smooth

geom_line

geom_density_ridges (from {ggridges})

geom_boxplot

geom_violin

geom_col

Discrete Variable 1 geom_bar x geom_count

Other

Heatmap: geom_tile


geom_histogram()

General research question: How does the values of my continuous variable vary across its range?

Our data specific question: How is the distribution of penguin bill lengths (mm) in this sample? Any outliers? Unimodal?

penguins |> 
  ggplot(aes(x = bill_length_mm)) + # Remember to use + instead of |> or %>%
  geom_histogram()

Color vs Fill
penguins |> 
  ggplot(aes(x = bill_length_mm)) +
  geom_histogram(color = "blue")

penguins |> 
  ggplot(aes(x = bill_length_mm)) +
  geom_histogram(color = "blue", 
                 fill = "green")

Color = outline

Fill = area

Transparency
penguins |> 
  ggplot(aes(x = bill_length_mm)) +
  geom_histogram(color = "blue", 
                 fill = "green",
                 alpha = 0.2)

Color, fill & alpha in this example area all fixed settings (i.e., applies to all data points).

More aes mapping
penguins |> 
  ggplot(aes(x = bill_length_mm)) +
  geom_histogram(aes(fill = sex), # note that fill is inside  aes()
                 alpha = 0.7)

Fill here is a conditional mapping, meaning that the fill color is different based on the variable (in this case the sex of the birds).

Fixed vs Conditional
penguins |> 
ggplot(aes(x = bill_length_mm)) +
geom_histogram(fill = "green")

In the above example where fill is not within aes(), fill is a fixed setting. Also notice that color is in quotes.

penguins |> 
ggplot(aes(x = bill_length_mm)) +
geom_histogram(aes(fill = sex))

In the above example, aes() is used to access variables and make changes according to a specific variable. Here, fill is a conditional on the variable, sex. Also notice that variables are not in quotes.

a <- penguins |> 
ggplot(aes(x = bill_length_mm)) +
geom_histogram(fill = "green")

b <- penguins |> 
ggplot(aes(x = bill_length_mm)) +
geom_histogram(aes(fill = sex))

#install.packages("patchwork")
library(patchwork)
a+b

Be mindful of aes()

penguins |>
ggplot(aes(x = bill_length_mm))+
geom_histogram(fill = “green”)
penguins |>
ggplot(aes(x = bill_length_mm))+
geom_histogram(aes(fill = “green”))

Question: What is wrong with the bottom code? How do you think the plot will look like?

Answer: 
penguins |> 
ggplot(aes(x = bill_length_mm)) +
geom_histogram(aes(fill = "green"))

geom_density()

General research question: How does the probability density of my continuous variable vary across its range?

Think of it as a smoothed histogram

  • Difference: not use bins; not use count but use relative frequency per unit of x
penguins |>
  ggplot(aes(x = bill_length_mm)) + 
  geom_density()

More aes mapping
penguins |>
  ggplot(aes(x = bill_length_mm)) +
  geom_density(aes(fill = sex))

Add transparency for clarity

penguins |>
  ggplot(aes(x = bill_length_mm)) +
  geom_density(aes(fill = sex), alpha = 0.5)

Histogram vs Density
a <- penguins |>
  ggplot(aes(x = bill_length_mm)) +
  geom_histogram(aes(fill = sex), alpha = 0.5)

b <- penguins |>
  ggplot(aes(x = bill_length_mm)) +
  geom_density(aes(fill = sex), alpha = 0.5)

a + b

Question: What is the difference that you see? When would you use one vs another?

Answer: 
facet_wrap

wrap by 1 variable

penguins |>
  ggplot(aes(x = bill_length_mm)) +
  geom_density(aes(fill = sex), alpha = 0.5) +
  facet_wrap(~sex) # remember to use ~

wrap by 2 variables

penguins |>
  ggplot(aes(x = bill_length_mm)) +
  geom_density(aes(fill = sex), alpha = 0.5) +
  facet_wrap(year~sex)

wrap using vars()

penguins |>
  ggplot(aes(x = bill_length_mm)) +
  geom_density(aes(fill = sex), alpha = 0.5) +
  facet_wrap(vars(year,sex))

geom_density_ridges()

geom_density_ridges: two variables

# install.packages("ggridges")

library(ggridges)

penguins |>
  ggplot(aes(bill_length_mm, sex)) +
  geom_density_ridges()

geom_point()

General research question: How are two numeric variables related? (raw observations)

Our data specific question: What is the relationship between penguin’s bill length and body mass?

penguins |>
  ggplot(aes(x = bill_length_mm, y = body_mass_g)) +
  geom_point()

Add color

penguins |>
  ggplot(aes(x = bill_length_mm, y = body_mass_g)) +
  geom_point(color = "magenta")

Emphasize specific data points (island = Torgersen)

penguins |>
  ggplot(aes(x = bill_length_mm, y = body_mass_g)) +
  geom_point(color = "magenta") +
  geom_point(data = filter(penguins, island == "Torgersen"), color = "blue")

penguins |>
  ggplot(aes(x = bill_length_mm, y = body_mass_g)) +
  geom_point(data = filter(penguins, island == "Torgersen"), color = "blue") +
  geom_point(color = "magenta")

Question: What happened when we switched the order of the geom_points?

Answer: 

Emphasize another way

# install.packages("gghighlight")



penguins |>
  ggplot(aes(x = bill_length_mm, y = body_mass_g)) +
  geom_point(color = "magenta") +
  gghighlight::gghighlight(island == "Torgersen")

geom_smooth()

General research question: What is the pattern of relationship of two continuous variables? (trend)

Our data specific question: What is the trend or pattern of relationship between penguin’s bill length and body mass?

penguins |>
  ggplot(aes(bill_length_mm, body_mass_g)) + 
  geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Method
penguins |>
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_smooth(method = "lm")

No need to include “x =” or “y =” because ggplot assumes the first argument will be x and then y.

penguins |>
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_smooth(method = "lm", level = .65)

penguins |>
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_smooth(method = "lm", se=FALSE)

Note: This is not the same as geom_line(). We are fitting a line of best fit with geom_smooth()

Adding Layers

penguins |>
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point() +
  geom_smooth(method = "lm")

Global

If we use something like color = “x” in the first aesthetic, it will carry on through all additional layers.

penguins |> 
  ggplot(aes(bill_length_mm, body_mass_g, color = species)) + #color = spieces
  geom_point() +
  geom_smooth(method = "lm")

penguins |> 
  ggplot(aes(bill_length_mm, body_mass_g, color = species)) +
  geom_point(aes(color = species)) +
  geom_smooth(method = "lm", aes(color = species))

Local
penguins |> 
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point(aes(color = species)) + #color = spieces
  geom_smooth(method = "lm")

geom_line()

  • geom_point: raw observations, not linked

  • geom_smooth: trend/pattern

  • geom_line: raw data linked

penguins |>
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point() +
  geom_smooth(method = "lm")

penguins |>
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point() +
  geom_smooth(method = "lm") + 
  geom_line()

When should you use line plots?

  • Usually when time is involved

  • One time point per line or per group

  • Shows linkage

# Original data
head(penguins)
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>
# Create new data set so that there is only one data point per year
penguins_year <- penguins |>
  group_by(year) |> 
  summarize(avg_bill = mean(bill_length_mm, na.rm=TRUE))
head(penguins_year)
# A tibble: 3 × 2
   year avg_bill
  <int>    <dbl>
1  2007     43.7
2  2008     43.5
3  2009     44.5
penguins_year |>
  ggplot(aes(year, avg_bill)) + 
  geom_line()

# Create new data set so that there is one data point for each year for each species
penguins_year_species <- penguins |> 
  group_by(year, island) |> 
  summarize(avg_bill = mean(bill_length_mm, na.rm=TRUE))

head(penguins_year_species)
# A tibble: 6 × 3
# Groups:   year [2]
   year island    avg_bill
  <int> <fct>        <dbl>
1  2007 Biscoe        45.0
2  2007 Dream         44.5
3  2007 Torgersen     38.8
4  2008 Biscoe        44.6
5  2008 Dream         43.8
6  2008 Torgersen     38.8
penguins_year_species |>
  ggplot(aes(year, avg_bill, group = island, color = island)) + 
  geom_line()

geom_boxplot()

General research question: How is a continuous variable distributed across groups, and how do the medians, quartiles, and potential outliers compare?

penguins |>
  ggplot(aes(species, body_mass_g)) +
  geom_boxplot()

geom_violin()

General research question: How is the full distribution of a continuous variable shaped across groups?

penguins |>
  ggplot(aes(species, body_mass_g)) +
  geom_violin()

geom_bar()

geom_bar() vs geom_col()
geom_bar() geom_col()
  • 1 discrete variable
  • 2 variables

    • 1 continuous

    • 1 discrete (unique)

  • counts rows

  • height of the bar is proportional to the number of cases in each group

  • need to have a variable with numbers in your data (average, proportion, count)
penguins |> 
  ggplot(aes(species)) + # one variable in the `aes()`
  geom_bar()

geom_col()

summarized_penguins <- penguins |> 
  group_by(species) |> 
  summarize(N = n())

head(summarized_penguins)
# A tibble: 3 × 2
  species       N
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124
summarized_penguins |>
  ggplot(aes(species, N)) +
  geom_col()

More aes mapping
summarized_penguins2 <- penguins |>
  group_by(species, sex) |>
  na.omit() |> 
  summarize(bill_length_avg = mean(bill_length_mm))

summarized_penguins2
# A tibble: 6 × 3
# Groups:   species [3]
  species   sex    bill_length_avg
  <fct>     <fct>            <dbl>
1 Adelie    female            37.3
2 Adelie    male              40.4
3 Chinstrap female            46.6
4 Chinstrap male              51.1
5 Gentoo    female            45.6
6 Gentoo    male              49.5
ggplot(summarized_penguins2, aes(species, bill_length_avg)) +
  geom_col(aes(fill = sex))

Position
ggplot(summarized_penguins2, aes(species, bill_length_avg)) +
  geom_col(aes(fill = sex), position = "dodge")

coord_flip
ggplot(summarized_penguins2, aes(species, bill_length_avg)) +
  geom_col(aes(fill = sex), position = "dodge") +
  coord_flip()

geom_count()

General research question: How many observations fall in each category pair?

Our data specific question: How many of each species live in each island?

penguins |>
  ggplot(aes(species, island)) +
  geom_count()

Scales
penguins |>
  ggplot(aes(species, island)) +
  geom_count(aes(color = after_stat(n)))+
  scale_color_gradient(low = "lightblue", high = "brown")

What do scales do?

Scales control how the mappings you added to aes are displayed (e.g., color range, size range, breaks and labels, range or limits)

Template: scale_*

Most aes mappings: x, y, size, color, fill, line, alpha, etc

Colorblind friendly

penguins |>
  ggplot(aes(species, island)) +
  geom_count(aes(color = after_stat(n)))+
  scale_color_viridis_c()

penguins |>
  ggplot(aes(species, island)) +
  geom_count(aes(color = after_stat(n)))+
  scale_color_viridis_c(option = "turbo") #magma, interno, plasma, viridis, cividis, rocket, mako, turbo or A-H

#install.packages("MetBrewer")

penguins |>
  ggplot(aes(species, island)) +
  geom_count(aes(color = after_stat(n)))+
  scale_color_gradientn(colors=MetBrewer::met.brewer("Isfahan1"))

geom_tile()

General research question: What’s the value of a numerical measure (Z) for each (X, Y) pair? In other words, what is the correlation (Z) for each X,Y pair?

corr <- penguins |>
  select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) |>
  drop_na() |>                
  cor()

pc <- corr |> 
  as.data.frame() |> 
  rownames_to_column(var = "row") |> 
  pivot_longer(
    cols = -row,
    names_to = "col",
    values_to = "cor")

head(pc)
# A tibble: 6 × 3
  row            col                  cor
  <chr>          <chr>              <dbl>
1 bill_length_mm bill_length_mm     1    
2 bill_length_mm bill_depth_mm     -0.235
3 bill_length_mm flipper_length_mm  0.656
4 bill_length_mm body_mass_g        0.595
5 bill_depth_mm  bill_length_mm    -0.235
6 bill_depth_mm  bill_depth_mm      1    
ggplot(pc, aes(row, col, fill = cor)) +
  geom_tile()

ggplot(pc, aes(row, col, fill = cor)) +
  geom_tile() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(pc, aes(row, col, fill = cor)) +
  geom_tile() +
  scale_fill_viridis_c()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))


Other Layers

labels

axis labels
penguins |> 
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point(aes(color = species)) 

penguins |> 
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point(aes(color = species)) + 
  labs(x="Bill length (mm)",
       y="Body mass (g)")

title
penguins |> 
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point(aes(color = species)) + 
  labs(x="Bill length (mm)",
       y="Body mass (g)",
       title = "Relationship between bill length and body mass")

subtitle
penguins |> 
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point(aes(color = species)) + 
  labs(x="Bill length (mm)",
       y="Body mass (g)",
       title = "Relationship between bill length and body mass",
       subtitle="Grouped by species")

caption
penguins |> 
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point(aes(color = species)) + 
  labs(x="Bill length (mm)",
       y="Body mass (g)",
       title = "Relationship between bill length and body mass",
       subtitle="Grouped by species",
       caption = "palmerpenguins")

tag
penguins |> 
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point(aes(color = species)) + 
  labs(x="Bill length (mm)",
       y="Body mass (g)",
       title = "Relationship between bill length and body mass",
       subtitle="Grouped by species",
       caption = "palmerpenguins",
       tag = "(A)")

legend (one way)
penguins |> 
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point(aes(color = species)) + 
  labs(x="Bill length (mm)",
       y="Body mass (g)",
       title = "Relationship between bill length and body mass",
       subtitle="Grouped by species",
       caption = "palmerpenguins",
       tag = "(A)",
       color="SPECIES!")


theme

The default is theme_gray(). There are a lot of built-in alternative in {ggplot2}. My go-to is theme_minimal() because it is clean without a lot of unnecessary visuals.

If you want to set theme globally (meaning to all your graphs in your document), add theme_set(theme_minimal()) to the first line after you load your libraries.

penguins |> 
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point(aes(color = species)) + 
  labs(x="Bill length (mm)",
       y="Body mass (g)",
       title = "Relationship between bill length and body mass",
       subtitle="Grouped by species",
       caption = "palmerpenguins",
       tag = "(A)",
       color="SPECIES!") +
  theme_minimal()

Other packages:

#install.packages("ggthemes")

penguins |> 
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point(aes(color=species)) + 
    ggthemes::theme_economist()+
  ggthemes::scale_color_economist()

penguins |> 
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point(aes(color = species)) + 
  labs(x        = "Bill length (mm)",
       y        = "Body mass (g)",
       title    = "Relationship between bill length and body mass",
       subtitle = "Grouped by species",
       caption  = "palmerpenguins",
       tag      = "(A)",
       color    = "SPECIES!") +
  theme(plot.title        = element_text(size=13, face="bold", hjust =0.5), 
        axis.title        = element_text(size=11, family="Georgia"), 
        axis.text.x       = element_text(size=10, angle = 45, hjust=1),
        panel.background  = element_rect(fill = "grey95"),
        plot.background   = element_rect(fill = "white"),
        panel.grid.major  = element_line(color = "black"),
        panel.grid.minor  = element_blank(),
        legend.position   = "top",
        legend.title      = element_text(face="bold"),
        legend.background = element_rect(fill = "transparent")) 


Practice together

1

Get to know the data - str(mpg) or head(mpg)

2

What is the overall distribution of city fuel efficiency (mpg) across car models?

3

How does the distribution vary by drivetrain type (e.g., front-, rear-, 4-wheel drive)?

4

What is the relationship between city and highway mpg?

5

Can we focus on/emphasize Audi’s relationship?

6

Can we have larger points for clarity?

7

How are the city/hwy mpg relationships different by car class?

8

Too much clutter. Can we just see trends?

9

Still too much clutter. Better way to clearly see each trends?

10

Can we make it colorblind friendly?

11

Can we clarify axis and legend labels?

12

Can we polish the appearance with a theme?